-
Notifications
You must be signed in to change notification settings - Fork 19
[WIP] Ncu profile #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Ncu profile #368
Conversation
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
248c411 to
ce0c6ec
Compare
small code improvements
ae72216 to
4928fc2
Compare
de51bbf to
779ee7f
Compare
16b4d0c to
d8d7145
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This is a work-in-progress PR that adds NVIDIA Nsight Compute (NCU) profiling support to complement the existing ROCm profiling capabilities. The changes refactor the profiling infrastructure to support multiple profilers and enable per-benchmark profiling runs.
- Adds NCU profiling implementation with filtered report generation
- Refactors profiling to run each benchmark separately and embed trace data in results
- Updates test infrastructure and workflows for new GPU runners
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libkernelbot/run_eval.py | Adds NCU profiling function, refactors profile_program to use temporary directories, changes parameter order to use keyword-only arguments, and modifies run_evaluation to handle per-benchmark profiling |
| src/libkernelbot/report.py | Adds File dataclass for attaching profile files to reports, updates report generation to handle multiple profile runs, and adds _shortname helper for filename generation |
| tests/test_report.py | Updates tests to reflect new profile reporting structure with trace field and File attachments |
| examples/eval.py | Adds _run_single_profile_ncu function for NCU-specific profiling and updates profile selection logic based on POPCORN_NCU environment variable |
| src/kernelbot/discord_utils.py | Adds _send_file helper function and updates _send_split_log to add newlines properly and send messages silently |
| src/kernelbot/discord_reporter.py | Adds File handling to display_report for uploading profile attachments |
| src/kernelbot/api/api_utils.py | Adds check to block profile submissions via API and reformats code |
| src/runners/modal_runner.py | Updates Python version from 3.12 to 3.13 in CUDA image |
| .github/workflows/nvidia_workflow.yml | Updates runner to nvidia-docker-b200-8-x86-64, removes container configuration, and simplifies setup steps |
| .github/workflows/nvidia-arc-health.yml | Updates runner and removes Python/PyTorch installation steps |
| src/libkernelbot/launchers/github.py | Adds conditional check for profile-data artifact before setting download_url and reduces polling interval |
| scripts/ci_test_python.py | Updates test to use python3 command and keyword argument for system parameter |
| scripts/ci_test_cuda.py | Updates tests to use keyword argument for system parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if _handle_crash_report(report, prof_run): | ||
| return report | ||
|
|
||
| if prof_run.profile.trace is not None: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError if prof_run.profile is None. The code checks prof_run.profile.trace is not None without first verifying that prof_run.profile itself is not None. Consider adding a null check:
if prof_run.profile is not None and prof_run.profile.trace is not None:| if prof_run.profile.trace is not None: | |
| if prof_run.profile is not None and prof_run.profile.trace is not None: |
| if submission_mode_enum == SubmissionMode.PROFILE: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail="Profile submissions are not currently supported via API, use Discord instead.", | ||
| ) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate check for SubmissionMode.PROFILE on lines 214 and 225. The first check already raises an exception if the mode is PROFILE, so the second check on line 225 is unreachable. Consider removing the duplicate check on lines 225-229.
| if submission_mode_enum == SubmissionMode.PROFILE: | |
| raise HTTPException( | |
| status_code=400, | |
| detail="Profile submissions are not currently supported via API, use Discord instead.", | |
| ) |
| - name: Create input files | ||
| shell: bash |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Create input files" step runs apt-get commands without sudo and without a container context. The previous version used a container (image: nvidia/cuda:12.4.0-devel-ubuntu22.04), which provided a root environment. Without a container, these commands will likely fail on the self-hosted runner unless the runner is configured to run as root (which is a security risk). Consider either restoring the container setup or using sudo for the apt-get commands, or ensure jq is pre-installed on the runner.
| def log_one(base_name): | ||
| spec = run.result.get(f"{base_name}.spec") | ||
|
|
||
| report: str = run.result.get(f"{base_name}.report") |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential error if report is None. The code calls .encode() on report without checking if it exists first. If benchmark.{i}.report is not in the result dictionary, this will raise an AttributeError. Consider adding a check:
report: str = run.result.get(f"{base_name}.report")
if report is None:
return
report = base64.b64decode(report.encode("utf-8"), b"+*").decode("utf-8")| report: str = run.result.get(f"{base_name}.report") | |
| report: str = run.result.get(f"{base_name}.report") | |
| if report is None: | |
| return |
| """ | ||
| Profiles a single benchmark using ncu. Note: this does not | ||
| invoke NCU; instead, it is expected that eval is launched | ||
| under NCU, and this function will rurnthe kernel excactly |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the docstring: "rurnthe" should be "run the".
| under NCU, and this function will rurnthe kernel excactly | |
| under NCU, and this function will run the kernel excactly |
| report.add_file( | ||
| f"profile-{_shortname(prof_run.run.result.get('benchmark.0.spec'))}.zip", | ||
| f"{prof_run.profile.profiler} report - " + prof_run.run.result.get("benchmark.0.spec"), |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential error if benchmark.0.spec is not in the result dictionary. The get() method will return None if the key doesn't exist, which will cause _shortname() to fail with an AttributeError when trying to call .replace() on None. Consider adding a default value or null check:
spec = prof_run.run.result.get('benchmark.0.spec', 'unknown')
report.add_file(
f"profile-{_shortname(spec)}.zip",
f"{prof_run.profile.profiler} report - {spec}",
base64.b64decode(prof_run.profile.trace),
)| report.add_file( | |
| f"profile-{_shortname(prof_run.run.result.get('benchmark.0.spec'))}.zip", | |
| f"{prof_run.profile.profiler} report - " + prof_run.run.result.get("benchmark.0.spec"), | |
| spec = prof_run.run.result.get('benchmark.0.spec', 'unknown') | |
| report.add_file( | |
| f"profile-{_shortname(spec)}.zip", | |
| f"{prof_run.profile.profiler} report - {spec}", |
|
|
||
| if prof_run.profile.trace is not None: | ||
| report.add_log( | ||
| f"Profiling {prof_run.run.result.get('benchmark.0.spec')}", |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential error if benchmark.0.spec is not in the result dictionary. The get() method will return None if the key doesn't exist, which will be interpolated as the string "None" in the log header. Consider adding a default value:
f"Profiling {prof_run.run.result.get('benchmark.0.spec', 'unknown benchmark')}"| f"Profiling {prof_run.run.result.get('benchmark.0.spec')}", | |
| f"Profiling {prof_run.run.result.get('benchmark.0.spec', 'unknown benchmark')}", |
| with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: | ||
| with nvtx_range("custom_kernel"): | ||
| submission_output = custom_kernel(_clone_data(data, 0)) | ||
| submission_output = custom_kernel(cloned) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable submission_output is not used.
|
|
||
| cloned = _clone_data(data, 0) | ||
| with nvtx_range("custom_kernel"): | ||
| submission_output = custom_kernel(cloned) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable submission_output is not used.
| submission_output = custom_kernel(cloned) | |
| custom_kernel(cloned) |
| ] + call | ||
|
|
||
| run_result = run_program( | ||
| call, seed=seed, timeout=timeout, multi_gpu=multi_gpu, extra_env={"POPCORN_NCU": "1"} |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
No description provided.